`rabbit_msg_store`: terminate GC with exit signal during shutdown by lukebakken · Pull Request #15498 · rabbitmq/rabbitmq-server

lukebakken · 2026-02-17T22:32:58Z

Problem

When rabbit_msg_store shuts down, its terminate/2 callback calls rabbit_msg_store_gc:stop/1, which sends a gen_server2:call(stop, infinity) to the GC process. If the GC is blocked mid-handle_cast on disk I/O (for example, during compaction under disk pressure), the call blocks indefinitely. terminate/2 never reaches the code that writes the recovery files (file_summary.ets, msg_store_index.ets, clean.dot). Eventually the msg_store child's supervisor shutdown timeout expires (msg_store_shutdown_timeout, default 600s) and the supervisor kills the msg_store process. With no recovery files on disk, the next startup rebuilds indices from scratch by scanning every segment file, which is slow and expensive for large stores.

We hit this in production on a broker under PerfTest load. The persistent message store for the / vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it with reason killed. The GC process was blocked on disk I/O while disk free space was hovering near the 2 GiB limit. On restart, the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully via rabbitmqctl stop.

Fix

Replace the synchronous rabbit_msg_store_gc:stop/1 call in terminate/2 with a new stop_gc/2 helper that monitors the GC, sends exit(GcPid, shutdown), and waits for the 'DOWN' message.

rabbit_msg_store_gc does not trap exits, so an exit signal terminates it immediately even if it is running inside a handle_cast callback on disk I/O. Exit signals are processed by the scheduler, not the user-level receive loop, so the signal preempts the blocked callback. The GC's terminate/2 is a no-op, so bypassing it has no side effect.

The wait for 'DOWN' is bounded by max(msg_store_shutdown_timeout - 60_000, 5_000) so that terminate/2 stays within the msg_store child's own supervisor shutdown timeout and leaves at least 60s for the remaining steps (syncing the current file, writing the file summary, tearing down ETS, writing recovery terms). If the shutdown signal does not produce a 'DOWN' in time, fall back to exit(GCPid, kill). kill cannot be trapped, so the inner receive has no timeout. A warning is logged on the fallback path so the case can be tracked in production logs.

Killing the GC mid-operation is safe with respect to message data:

compact_file copies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate."
truncate_file only removes data that has already been compacted to earlier offsets.
delete_file only deletes files with zero valid messages, enforced by assertions before the delete.

The unclean recovery path (build_index/3) rebuilds everything from the actual segment files on disk using scan_file_for_valid_messages, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the recovery files are fully consistent and the next startup recovers cleanly without a rebuild.

Tests

Two test cases in backing_queue_SUITE exercise the fix:

msg_store_gc_stuck_suspended suspends the GC with sys:suspend, terminates the persistent store via the supervisor, and verifies that the store recovers cleanly (successfully_recovered_state returns true) with all messages intact. This covers the case where the GC is blocked in its receive loop.
msg_store_gc_stuck_mid_callback mocks compact_file/2 to block indefinitely inside the handle_cast callback, sends a compact cast to the GC to put it in the blocking callback, then terminates the store via the supervisor. This covers the scenario that motivated the PR: a GC stuck on disk I/O inside a callback.

Both tests confirm that exit(GcPid, shutdown) terminates the GC process regardless of what Erlang code it is running, that terminate/2 completes and writes the recovery files, and that the next startup recovers cleanly.

lukebakken · 2026-02-17T22:47:23Z

Leaving as draft until I can re-reproduce the issue without this fix, then verify this fix, in a "real" environment. Early reviews are welcome, of course 😸

michaelklishin · 2026-02-19T03:55:01Z

@lhoguin can you please take a quick look? Thank you.

lukebakken · 2026-02-19T15:51:27Z

No hurry because I still am working on this PR and testing it.

lhoguin · 2026-02-23T10:02:08Z

Perhaps just change rabbit_msg_store_gc to do exit(GcPid, shutdown) instead of sending a message. It'll stop faster for everyone.

lukebakken · 2026-04-15T21:59:02Z

Hi Loïc, the genie and I looked into this and I want to make sure I understand your suggestion correctly.

rabbit_msg_store_gc runs as a gen_server2, which sets trap_exit = true. We dug into the OTP signal handling code (erts/emulator/beam/erl_proc_sig_queue.c, OTP-26.2.5.11, around line 4266) and confirmed that when a process has F_TRAP_EXIT set, any exit signal except kill is converted to a message in the process mailbox:

if ((op != ERTS_SIG_Q_OP_EXIT || reason != am_kill)
    && (c_p->flags & F_TRAP_EXIT)) {
    convert_prepared_sig_to_msg(c_p, sig, xsigd->message, next_nm_sig);

So exit(GcPid, shutdown) doesn't immediately terminate the GC - it just puts a message in its mailbox, the same as the stop call does. If the GC is blocked mid-callback on disk I/O, the signal sits in the mailbox until the callback returns, which is exactly the scenario we're trying to fix.

Only exit(GcPid, kill) bypasses trap_exit and causes immediate termination, which is what the PR already falls back to after the timeout expires.

Am I missing something in your suggestion?

Addendum: The analysis above is incorrect. It was produced by Claude (an AI assistant) and I posted it without verifying the load-bearing claim. rabbit_msg_store_gc does not set trap_exit:

deps/rabbit/src/rabbit_msg_store_gc.erl contains no process_flag calls; init/1 does not set trap_exit.
deps/rabbit_common/src/gen_server2.erl has exactly one process_flag(trap_exit, true) call, and it is inside the spawned middleman process in do_multi_call/4 — not on the init_it/6 path used by normal gen_server2 processes.

So @lhoguin is right: exit(GcPid, shutdown) would terminate the GC immediately regardless of what it is doing inside a callback. Apologies for the noise — I'll reconsider the approach in this PR.

lhoguin · 2026-04-20T09:56:47Z

Am I missing something in your suggestion?

I don't think it sets trap_exit? The only reference to trap_exit is in a temporary process to do multi-calls.

lukebakken · 2026-04-20T22:28:17Z

Thanks again @lhoguin for taking a look.

When `rabbit_msg_store` shuts down, its `terminate/2` callback calls `rabbit_msg_store_gc:stop/1`, which sends a `gen_server2:call(stop, infinity)`. If the GC process is blocked mid-`handle_cast` on disk I/O (for example, during compaction under disk pressure), the call blocks indefinitely. `terminate/2` never reaches the code that writes the recovery files (`file_summary.ets`, `msg_store_index.ets`, `clean.dot`). Eventually the msg_store child's supervisor shutdown timeout expires (`msg_store_shutdown_timeout`, default 600s) and the supervisor kills the msg_store process with reason `killed`. With no recovery files on disk, the next startup rebuilds indices from scratch by scanning every segment file, which is slow and expensive for large stores. This was observed in production on a broker under PerfTest load. The persistent message store for the `/` vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it. The GC process was blocked on disk I/O while disk free space was hovering near the 2 GiB limit. On restart the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully via `rabbitmqctl stop`. Replace the synchronous `rabbit_msg_store_gc:stop/1` call in `terminate/2` with a new `stop_gc/1` helper that monitors the GC, sends `exit(GCPid, shutdown)`, and waits for the `'DOWN'` message. `rabbit_msg_store_gc` does not trap exits, so an exit signal terminates it immediately even if it is running inside a `handle_cast` callback on disk I/O. Exit signals are processed by the scheduler, not the user-level receive loop, so the signal preempts the blocked callback. The GC's `terminate/2` is a no-op, so bypassing it has no side effect. The wait for `'DOWN'` is bounded by `max(msg_store_shutdown_timeout - 60_000, 5_000)` so that `terminate/2` stays within the msg_store child's own supervisor shutdown timeout and leaves at least 60s for the remaining steps (syncing the current file, writing the file summary, tearing down ETS, writing recovery terms). If the shutdown signal does not produce a `'DOWN'` in time, fall back to `exit(GCPid, kill)`. `kill` cannot be trapped, so the inner `receive` has no timeout. Killing the GC mid-operation is safe with respect to message data: - `compact_file` copies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate." - `truncate_file` only removes data that has already been compacted to earlier offsets. - `delete_file` only deletes files with zero valid messages, enforced by assertions before the delete. The unclean recovery path (`build_index/3`) rebuilds everything from the actual segment files on disk using `scan_file_for_valid_messages`, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the recovery files are fully consistent and the next startup recovers cleanly without a rebuild. Add `rabbit_msg_store:gc_pid/1` to expose the GC pid for testing. Add two test cases to `backing_queue_SUITE`: - `msg_store_gc_stuck_suspended` suspends the GC with `sys:suspend`, terminates the persistent store via the supervisor, and verifies that the store recovers cleanly (`successfully_recovered_state` returns `true`) with all messages intact. This covers the case where the GC is blocked in its receive loop. - `msg_store_gc_stuck_mid_callback` mocks `compact_file/2` to block indefinitely inside the `handle_cast` callback, sends a compact cast to the GC to put it in the blocking callback, then terminates the store via the supervisor. This covers the scenario that motivated the PR: a GC stuck on disk I/O inside a callback. Both tests confirm that `exit(GcPid, shutdown)` terminates the GC process regardless of what Erlang code it is running, that `terminate/2` completes and writes the recovery files, and that the next startup recovers cleanly.

The `exit(GCPid, shutdown)` signal is expected to terminate the GC immediately because the GC does not trap exits. If it does not, the fallback `exit(GCPid, kill)` path runs silently and the operator has no visibility into a case that should, by design, never happen. Emit a warning on the fallback path identifying the store directory and the timeout that expired. This makes the fallback searchable in fleet-wide logs so we can tell whether the bounded-wait was ever actually needed in production. Thread the `Dir` argument back into `stop_gc/2` for the log message.

lhoguin · 2026-04-29T08:25:50Z

Thank you!!

`rabbit_msg_store`: terminate GC with exit signal during shutdown (backport #15498)

`rabbit_msg_store`: terminate GC with exit signal during shutdown (backport #15498) (backport #16266)

lukebakken requested review from lhoguin, michaelklishin and the-mikedavis February 17, 2026 22:32

lukebakken self-assigned this Feb 17, 2026

lukebakken force-pushed the fix/msg-store-gc-stop-timeout branch 2 times, most recently from e2d9e27 to 37e634c Compare April 15, 2026 21:55

lukebakken force-pushed the fix/msg-store-gc-stop-timeout branch 2 times, most recently from ec554c8 to b2fad2f Compare April 20, 2026 00:36

lukebakken force-pushed the fix/msg-store-gc-stop-timeout branch 4 times, most recently from ff3f430 to 9f4ffaf Compare April 20, 2026 22:28

lukebakken marked this pull request as ready for review April 20, 2026 22:28

lukebakken changed the title ~~rabbit_msg_store: use bounded timeout for GC stop during shutdown~~ rabbit_msg_store: terminate GC with exit signal during shutdown Apr 20, 2026

lukebakken force-pushed the fix/msg-store-gc-stop-timeout branch 3 times, most recently from 380cea7 to 28dd04d Compare April 27, 2026 13:31

lukebakken added 2 commits April 28, 2026 09:58

lukebakken force-pushed the fix/msg-store-gc-stop-timeout branch from 28dd04d to 91ec75f Compare April 28, 2026 16:58

michaelklishin added backport-v4.2.x backport-v4.3.x labels Apr 28, 2026

michaelklishin added this to the 4.4.0 milestone Apr 28, 2026

lhoguin approved these changes Apr 29, 2026

View reviewed changes

lhoguin merged commit ee8ece2 into rabbitmq:main Apr 29, 2026
189 checks passed

mergify Bot mentioned this pull request Apr 29, 2026

rabbit_msg_store: terminate GC with exit signal during shutdown (backport #15498) #16266

Merged

lhoguin added a commit that referenced this pull request Apr 29, 2026

Merge pull request #16266 from rabbitmq/mergify/bp/v4.3.x/pr-15498

8b02866

`rabbit_msg_store`: terminate GC with exit signal during shutdown (backport #15498)

mergify Bot mentioned this pull request Apr 29, 2026

rabbit_msg_store: terminate GC with exit signal during shutdown (backport #15498) (backport #16266) #16267

Merged

lhoguin added a commit that referenced this pull request Apr 29, 2026

Merge pull request #16267 from rabbitmq/mergify/bp/v4.2.x/pr-16266

7501bf7

`rabbit_msg_store`: terminate GC with exit signal during shutdown (backport #15498) (backport #16266)

lukebakken deleted the fix/msg-store-gc-stop-timeout branch April 29, 2026 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`rabbit_msg_store`: terminate GC with exit signal during shutdown#15498

`rabbit_msg_store`: terminate GC with exit signal during shutdown#15498
lhoguin merged 2 commits intorabbitmq:mainfrom
amazon-mq:fix/msg-store-gc-stop-timeout

lukebakken commented Feb 17, 2026 •

edited

Loading

Uh oh!

lukebakken commented Feb 17, 2026

Uh oh!

michaelklishin commented Feb 19, 2026

Uh oh!

lukebakken commented Feb 19, 2026

Uh oh!

lhoguin commented Feb 23, 2026

Uh oh!

lukebakken commented Apr 15, 2026 •

edited

Loading

Uh oh!

lhoguin commented Apr 20, 2026

Uh oh!

lukebakken commented Apr 20, 2026

Uh oh!

Uh oh!

lhoguin commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukebakken commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

lukebakken commented Feb 17, 2026

Uh oh!

michaelklishin commented Feb 19, 2026

Uh oh!

lukebakken commented Feb 19, 2026

Uh oh!

lhoguin commented Feb 23, 2026

Uh oh!

lukebakken commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoguin commented Apr 20, 2026

Uh oh!

lukebakken commented Apr 20, 2026

Uh oh!

Uh oh!

lhoguin commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukebakken commented Feb 17, 2026 •

edited

Loading

lukebakken commented Apr 15, 2026 •

edited

Loading